Tagging Unknown Words with Raw Text Features
نویسندگان
چکیده
Processing unknown words is disproportionately important because of their high information content. It is crucial in domains with specialist vocabularies where relevant training material is scarce, for example: biological text. Unknown word processing often begins with Part of Speech (POS) tagging, where accuracy is typically 10% worse than on known words. We demonstrate that features extracted from large raw text corpora can significantly increase accuracy on unknown words. These features supply a large part of what we are missing with unknown words: context information about how the word is used. We describe a Maximum Entropy modelling approach which uses real-valued features to represent unannotated contextual information. Our initial experiments with real-valued features have resulted in an increased accuracy from 87.39% to 88.85% on unknown words.
منابع مشابه
Tigrinya Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka Tigrinya CorpusTigrinya Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka Tigrinya Corpus
This paper presents the first part-of-speech (POS) tagging research for Tigrinya (Semitic language) from the newly constructed Nagaoka Tigrinya Corpus. The raw text was extracted from a newspaper published in Eritrea in the Tigrinya language. This initial corpus was cleaned and formatted in plaintext and the Text Encoding Initiative (TEI) XML format. A tagset of 73 tags was designed, and the co...
متن کاملMultilingual Word Segmentation and Part - of - Speech Tagging : a Machine Learning Approach Incorporating Diverse Features ∗
The aim of this dissertation is to study statistical methods for multilingual word segmentation and POS tagging with high accuracy. Word segmentation and part-of-speech (POS) tagging are fundamental language analysis tasks in natural language processing, and used in many applications. Existence of unknown words is a large problem in these tasks and they need to be properly handled. We attempt t...
متن کاملReduce Meaningless Words for Joint Chinese Word Segmentation and Part-of-speech Tagging
Conventional statistics-based methods for joint Chinese word segmentation and partof-speech tagging (S&T) have generalization ability to recognize new words that do not appear in the training data. An undesirable side effect is that a number of meaningless words will be incorrectly created. We propose an effective and efficient framework for S&T that introduces features to significantly reduce ...
متن کاملبرچسبگذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی
Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...
متن کاملHybrid Models for Chinese Unknown Word Resolution Dissertation
Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005